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Abstract. From social science to biology, numerous applications often rely on graphlets 
for intuitive and meaningful characterization of networks at both the global macro-level 
as well as the local micro-level. While graphlets have witnessed a tremendous success 
and impact in a variety of domains, there has yet to be a fast and efficient approach 
for computing the frequencies of these subgraph patterns. However, existing methods 
are not scalable to large networks with millions of nodes and edges, which impedes the 
application of graphlets to new problems that require large-scale network analysis. To 
address these problems, we propose a fast, efficient, and parallel algorithm for counting 
graphlets of size k = {3,4}-nodes that take only a fraction of the time to compute 
when compared with the current methods used. The proposed graphlet counting algo¬ 
rithms leverages a number of proven combinatorial arguments for different graphlets. 
For each edge, we count a few graphlets, and with these counts along with the combi¬ 
natorial arguments, we obtain the exact counts of others in constant time. On a large 
collection of 300-1- networks from a variety of domains, our graphlet counting strategies 
are on average 460x faster than current methods. This brings new opportunities to 
investigate the use of graphlets on much larger networks and newer applications as we 
show in the experiments. To the best of our knowledge, this paper provides the largest 
graphlet computations to date as well as the largest systematic investigation on over 
300-1- networks from a variety of domains. 

Keywords: Graphlet; Motif; Graph Mining; Graph Kernel; Glassification; Graph Fea¬ 
tures; Higher-order Graph Statistics; Biological Networks; Visual Graph Analytics 


1. Introduction 


Recursive decomposition of networks is a widely used approach in network analy¬ 
sis to factorize the complex structure of real-world networks into small subgraph 
patterns of size k nodes. These patterns are called graphlets (Przulj, Cornell 
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and Jurisica, 2004). Graphlets (also known as motifs (Milo, Shen-Orr, Itzkovitz, 
Kashtan, Chklovskii and Alon, 2002)) are defined as subgraph patterns recur¬ 
ring in real-world networks at frequencies that are statistically significant from 
those in random networks. Given a network, we can count the number of em¬ 
bedding of each graphlet in the network, creating a profile of sufficient statistics 
that characterizes the network structure (Shervashidze, Petri, Mehlhorn, Borg- 
wardt and Vishwanathan, 2009). While knowing the graphlet frequencies does 
not uniquely define the network structure, it has been shown that graphlet fre¬ 
quencies often carry significant information about the local network structure in 
a variety of domains (Holland and Leinhardt, 1976; Faust, 2010; Frank, 1988). 
This is in contrast to global topological properties (e.g., diameter, degree dis¬ 
tribution), where networks with similar/exact global topological properties can 
exhibit significantly different local structures. 


1.1. Graphlets, Scalability, & Applications 

From social science to biology, graphlets have found numerous applications and 
were used as the building blocks of network analysis (Milo et ah, 2002). In so¬ 
cial science, graphlet analysis (typically known as /c-subgraph census) is widely 
adopted in sociometric studies (Holland and Leinhardt, 1976; Frank, 1988). 
Much of the work in this vein focused on analyzing triadic tendencies as im¬ 
portant structural features of social networks (e.g., transitivity or triadic clo¬ 
sure) as well as analyzing triadic configurations as the basis for various social 
network theories (e.g., social balance, strength of weak ties, stability of ties, or 
trust (Granovetter, 1983)). In biology (Przulj et al., 2004; Milenkoviae and Przulj, 
2008), graphlets were widely used for protein function prediction (Shervashidze 
et al., 2009), network alignment (Milenkovic, Ng, Hayes and Przulj, 2010), and 
phylogeny (Kuchaiev, Milenkovic, Memisevic, Hayes and Przulj, 2010) to name 
a few. More recently, there has been an increased interest in exploring the role of 
graphlet analysis in computer networking (Feldman and Shavitt, 2008; Hales and 
Arteconi, 2008; Becchetti, Boldi, Castillo and Gionis, 2008) (e.g., for web spam 
detection, analysis of peer-to-peer protocols and Internet AS graphs), chemoin- 
formatics (Ralaivola, Swamidass, Saigo and Baldi, 2005; Kashima, Saigo, Hat- 
tori and Tsuda, 2010), image segmentation (Zhang, Song, Liu, Liu, Bu and 
Chen, 2013), among others (Zhang, Han, Yang, Song, Yan and Tian, 2013). 

While graphlet counting and discovery have witnessed a tremendous suc¬ 
cess and impact in a variety of domains from social science to biology, there 
has yet to be a fast and efficient approach for computing the frequencies of 
these patterns. For instance, Shervashidze et al. (Shervashidze et al., 2009) takes 
hours to count graphlets on relatively small biological networks (i.e., few hun¬ 
dreds/thousands of nodes/edges) and uses such counts as features for graph 
classification (Vishwanathan, Schraudolph, Kondor and Borgwardt, 2010). Pre¬ 
vious work showed that graphlet counting is computationally intensive since the 
number of possible fc-subgraphs in a graph G increases exponentially with k in 
0{\V\^) and can be computed in 0{\V\./S^~^) for any bounded degree graph, 
where A is the maximum degree of the graph (Shervashidze et al., 2009). 

To address these problems, we propose a fast, efficient, and parallel algo¬ 
rithm for counting graphlets of size k = {3,4}-nodes that take only a fraction 
of the time to compute when compared with the current methods used. The 
proposed graphlet counting algorithm leverages a number of proven combinato- 
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rial arguments for different graphlets. For each edge, we count a few graphlets, 
and with these counts along with the combinatorial arguments, we obtain the 
exact counts of others in constant time. On a large collection of 300+ networks 
from a variety of domains, our graphlet counting strategies are on average 460x 
faster than current methods. This brings new opportunities to investigate the 
use of graphlets on much larger networks and newer applications as we show in 
our experiments. To the best of our knowledge, this paper provides the largest 
graphlet computations to date as well as the largest systematic investigation on 
over 300+ networks. 

Furthermore, a number of important machine learning tasks are likely to 
benefit from such an approach, including graph anomaly detection (Noble and 
Cook, 2003), as well as using graphlets as features for improving community 
detection (Schaeffer, 2007), role discovery (Rossi and Ahmed, 20156), graph 
classification (Vishwanathan et ah, 2010), and relational learning (Getoor and 
Taskar, 2007). 

We test the scalability of our proposed approach experimentally on 300+ 
networks from a variety of domains, such as biological, social, and technological 
domains. We compare our approach to the state-of-the-art exact counting meth¬ 
ods such as RAGE (Marcus and Shavitt, 2012), FANMOD (Wernicke and Rasche, 
2006), and Orca (Hocevar and Demsar, 2014). We found that RAGE (Marcus 
and Shavitt, 2012) took 2400 seconds to count graphlets on a small 26k node 
graph, whereas our proposed method is 460x faster, taking only 0.01 seconds. 
We also note that FANMOD (Wernicke and Rasche, 2006), another recent ap¬ 
proach, takes 172800 seconds, and Orca (Hocievar and Demsar, 2014) takes 2.5 
seconds for the same small graph. Our exact graphlet analysis is well-suited for 
shared-memory multi-core architectures (GPU and GPU), distributed architec¬ 
tures (MPl), and hybrid implementations that leverage the advantages of both. 


1.2. Contributions 

• Algorithms. A fast, efficient, and parallel graphlet counting algorithm that 
leverages a number of combinatorial arguments that we show for different 
graphlets. The combinatorial arguments we show in this paper enable us to 
obtain significant improvement on the scalability of graphlet counting. 

• Scalability. The proposed graphlet counting algorithm achieves on average 
460x runtime improvement over the state-of-the-art methods. In addition, we 
analyze graphlet counts on graphs of sizes that are beyond the scope of the 
state-of-the-art (e.g., on graphs with hundred million nodes and billion edges). 

• Effectiveness. Largest graphlet computations to date and largest systematic 
evaluation on over 300+ large-scale networks from a variety of domains. 

• Applications. We systematically investigate a variety of existing and new 
applications for graphlet counting, such as finding unique patterns in graphs, 
graph similarity, and graph classification. 


2. Background 

Graphlets are subgraph patterns recurring in real-world networks at frequencies 
that are significantly higher than those in random networks (Milo et ah, 2002; 
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Table 1. Summary of graphlet notation 


Summary of the notation and properties for the graphlets of size k = { 2 , 3 , 4 }. Note that p denotes 
density, A and d denote the max and mean degree, whereas assortativity is denoted by r. Also, |T| 
denotes the total number of triangles, K is the max k-core number, x denotes the Chromatic number, 
whereas D denotes the diameter, B denotes the max betweenness, and IC] denotes the number of 
components. Note that if |C| > 1 , then r, D, and B are from the largest component. 
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Przulj et al., 2004). Previous work showed that graphlets can be used to dehne 
universal classes of networks (Milo et ah, 2002). Moreover, graphlets are at the 
heart and foundation of many network analysis tasks (e.g., network classification, 
network alignment, etc.) (Przulj et al., 2004; Milenkovi® and Przulj, 2008; Hayes, 
Sun and Przulj, 2013). In this paper, we introduce an efficient algorithm to 
compute the number of embedding of each graphlet of size k = {2, 3,4} nodes in 
the network (see Table 1 for notation). 


2.1. Notation and Definitions 

Given an undirected simple input graph G = {V,E), a graphlet of size k nodes 
is defined as any subgraph Gk C G which consists of a subset of k nodes of 
the graph G. In this paper, we mainly focus on computing the frequencies of 
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induced graphlets. An induced graphlet is an induced subgraph that consists of 
all edges between its nodes that are present in the input graph (as described 
in Definition 1). In addition, we distinguish between connected and disconnected 
graphlets (see Table 1). A graphlet is connected if there is a path from any 
node to any other node in the graphlet (see Definition 2). Table 1 provides a 
summary of the notation and properties of all possible induced graphlets of size 
fc = {2,3,4}. 

Definition 1. Induced Graphlet: an induced graphlet Gk = {Vk,Ek) is a sub¬ 
graph that consists of a subset of k vertices of the graph G = {V,E) (i.e., 
14 C V) together with all the edges whose endpoints are both in this subset 
(i.e., Ek = {Ve € E \ e = {u,v) A u,v e 14}). 

Definition 2. Connected Graphlet: a graphlet Gk = {Vk,Ek) is connected when 
there is a path from any node to any other node in the graphlet (i.e., Vu,u € 
Vk,3Pu-v '■ u, ..., w, ..., u, such that d{u, v) > 0Ad{u, v) ^ oo). By definition, there 
exist one and only one connected component in a graphlet Gk (i-e., \G\ = 1) if 
and only if Gk is connected. 


Problem Definition. Given a family of graphlets of size k nodes Qk = 
{gkn 9 k 2 J ■■■j 9krn}i our goal is to count the number of embeddings (appear¬ 
ances) of each graphlet gk^ € Gk in the input graph G. In other words, we 
need to count the number of induced graphlets Gk in G that are isomorphic 
to each graphlet gk G Gk in the family, such a number is denoted by ( ‘^ ) 
(Gross, Yellen and Zhang, 2013). 


A graphlet (/fc. € Ufc is embedded in the graph G, if and only if there is 
an injective mapping cr : V),^. -A V, with e = (u,v) G Eg^_ if and only if 
e' = {a{u),a{v)) G E. Table i shows that \Gk\ = {2,4,11} when k = {2,3,4} 
respectively. Further, given a family Gk = {f/fcu ■••ifffem} of graphlets of size 
k nodes, we define f{gki,G) as the relative frequency of any graphlet g/j. € Gk 
in the input graph G. 


2.2. Relationship to Graph Complement 

The complement of a graph G, denoted by G, is the graph defined on the same 
vertices as G such that two vertices are connected in G if and only if they are not 
connected in G. Therefore, the graph sum G -I- G gives the complete graph on 
the set of vertices of G. There are direct relationships between the frequencies 
of graphlets and the frequencies of their complement. For each graphlet , 
there exists a non-isomorphic complementary graphlet pattern gk -, such that two 
vertices are connected in if and only if they are not connected in gk^ (Gross 
et ah, 2013). For example, cliques and independent sets of size k nodes are 
pairs of complementary graphlets. Similarly, chordal cycles of size 4 nodes are 
complementary to the 4-node-ledge graphlet (see Table 1). It is also worth noting 
that the 4-path graphlet is a self-complementary pattern, which means the 4- 
path is isomorphic to itself. From this discussion, it is clear that the number 
of embeddings of each graphlet G Gk in the input graph G is equivalent to 
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the number of embeddings of its complementary graphlet gk^ in the complement 
graph G. In other words, f{gki,G) = f{gki,G) (Gross et ah, 2013). 


2.3. Relationship to Graph/Matrix Reconstruction Theorems 

The graph reconstruction conjecture (Gross et ah, 2013), states that an undi¬ 
rected graph G can be uniquely determined up to an isomorphism, from the 
set of all possible vertex-deleted subgraphs of G (i.e., {G„}„gv) (McKay, 1997). 
Verification of this conjecture for all possible graphs up to 6 vertices was car¬ 
ried by Kelly (Kelly, 1957), and later was extended to up to 11 vertices by 
McKay (McKay, 1997). Glearly, if two graphs are isomorphic (i.e., G = G'), then 
their graphlet frequencies would be the same (i.e., fk{G) = fk{G')), but the 
reverse remains a conjecture for the general case of graphs. In contrast, the ma¬ 
trix reconstruction theorem has been resolved (Manvel and Stockmeyer, 1971), 
which states that any N x N matrix can be reconstructed from its list of all 
possible principal minors obtained by the deletion of the fc-th row and the k-th 
column (Manvel and Stockmeyer, 1971), which is the foundation of a class of 
graph kernels called the graphlet kernel (Shervashidze et ah, 2009). 


2.4. Related Work 

In this section, we briefly discuss some of the related work, highlighting vari¬ 
ous graph mining and machine learning tasks that would benefit from our ap¬ 
proach. Much of the previous work focused on counting certain types of graphlets 
(e.g., only connected graphlets such as cliques and cycles) (Kloks, Kratsch and 
Muller, 2000; Wernicke and Rasche, 2006; Hocevar and Demsar, 2014). How¬ 
ever, a number of graph mining and machine learning tasks rely on counting all 
graphlets of a certain size. 

For example, some previous work used the full spectrum of graphlet fre¬ 
quencies to define a domain-independent coordinate system in which collec¬ 
tions of graphs can be compactly represented and analyzed within a common 
space (Ugander, Backstrom and Kleinberg, 2013). Moreover, a variety of graph 
kernels have been proposed in machine learning (e.g., graphlet, subtree, and ran¬ 
dom walk kernels) (Vishwanathan et ah, 2010; Gosta and De Grave, 2010; Sher¬ 
vashidze et ah, 2009) to bridge the gap between graph learning and kernel meth¬ 
ods. And some types of the graph kernels, in particular the graphlet kernel, rely 
on counting all graphlets. However, a general limitation of most graph kernels 
(including the graphlet kernel) is that they scale poorly to large graphs with 
more than few hundreds/thousands of nodes (Vishwanathan et ah, 2010). Thus, 
our fast algorithms would speedup the computations of these methods and their 
related applications in graph modeling, similarity, and comparisons. 

Recently, there is an increased interest in sampling and other heuristic ap¬ 
proaches for obtaining approximate counts of various graphlets (Bhuiyan, Rah¬ 
man, Rahman and Al Hasan, 2012; Gonen and Shavitt, 2009). However, our 
approach focuses on exact graphlet counting and thus sampling methods are 
outside the scope of this paper. Nevertheless, the analysis and combinatorial ar¬ 
guments we show in this paper can be used along with efficient sampling methods 
to provide more accurate and efficient approximations. 

In addition, the aim and scope of this paper is different from the aforemen- 
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tioned problem of graph reconstruction. While graph reconstruction tries to test 
for the notion of isomorphism and structure equivalence between graphs, our 
goal is to relax the notion of equivalence to some form of structural similarity 
between graphs, such that the graph similarity is measured using the feature 
representation of graphlets. 


3. Framework 

In this section, we describe our approach for graphlet counting that takes only a 
fraction of the time to compute when compared with the current methods used. 
We introduce a number of combinatorial arguments that we show for different 
graphlets. The proposed graphlet counting algorithm leverages these combinato¬ 
rial arguments to obtain significant improvement on the scalability of graphlet 
counting. For each edge, we count only a few graphlets, and with these counts 
along with the combinatorial arguments, we derive the exact counts of the others 
in constant time. 


3.1. Searching Edge Neighborhoods 

Our proposed algorithm iterates over all the edges of the input graph G = [V, E). 
For each edge e = (m, v) G E, we define the neighborhood of an edge e, denoted 
by N{e), as the set of all nodes that are connected to the endpoints of e — 
i.e., M{e) = \ {?;}} U {A/’(u) \ {«}}, where M{u) and N{v) are the set of 

neighbors of u and v respectively. Given a single edge e = {u, v) G E, we explore 
the subgraph surrounding this edge — i.e., the subgraph induced by both its 
endpoints and the nodes in its neighborhood. We call this subgraph the egonet 
of the edge e, where e is the center (ego) of the subgraph. 

We search for possible graphlet patterns of size k = {3,4} in the egonets of 
all edges in the graph. By searching egonets of edges, we first map the problem to 
the local (lower-dimensional) space induced by the neighborhood of each edge, 
and then merge the search results for all edges. Searching over a local low¬ 
dimensional space of edge neighborhoods is clearly more efficient than searching 
over the global high-dimensional space of the whole graph. Moreover, searching 
over a local low-dimensional space of edge neighborhoods is amenable to parallel 
implementation, which offers additional speedup over iterative methods. Note 
that exhaustive search of the egonet of any edge e G E yields at least 
asymptotically, where A is the maximum degree in G. Clearly, exhaustive search 
is computationally intensive for large graphs, and our approach is more efficient 
as we will show next. 


3.2. Counting Graphlets of Size {k = 3) Nodes 

Algorithm 1 (TriadCensus) shows how to count graphlets of size fc = 3 for 
each edge. There are four possible graphlets of size fc = 3 nodes, where only 
gsj (i.e., triangle patterns) and g^^ (i.e., 2-star patterns) are connected graphlets 
(see Table 1). 
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Algorithm 1 Our exact triad census algorithm for counting all 3-node graphlets. The 
algorithm takes an undirected graph as input and returns the frequencies of all 3-node graphlets 

/(gs.G). _ 

1: procedure TriadCensus(G = {V, E)) 

2: Initialize Array X 

3: parallel for e = (u,v) S E do 

4: Starti = 0, Star„ = 0, Trie = 0 

5: for w S M(u) do 

6: if w = V then continue 

7: Add w to Starti and set X{w) = 1 

8: for w G J\f{v) do 

9: if w = u then continue 

10: if X{w) = 1 then > found triangle 

11: Add w to Trie 

12: Remove w from Star^^ 

13: else Add w to Start, 

14: /(a3i,G)+=|Trie| 

15: /(g 32 ,G)+= |Star„| + |Star„| 

16: f(g33,G)+=\V\-\JViu)UJViv)\ 

17: for w S M(u) do X(w) = 0 

18: end parallel 

19: /(33i,G) = 1/3./(<,3i,G) 

20: /(332,G) = 1/2./(<,32,G) 

21: /(934:G)= (1^1) - /tel, G)-7(932-G)-/(a33,G) 

22: return /(^ 3 , G) 


Connected graphlets of size k = 3. 

Lines 5—13 of Algorithm 1 show how to hnd and count triangles incident to 
an edge. For any edge e = (u,u), a triangle {u,v,w) exists, if and only if w is 
connected to both u and v. Let Trie be the set of all nodes that form a triangle 
with e = (u,u), and |Trie| be the number of such triangles. Then, Trig is the set 
of overlapping nodes in the neighborhoods of u and v — Trie = A/(m) H Af{v). 
Note that Algorithm 1 counts each triangle three times (one time for each edge 
in the triangle), and therefore we divide the total count by 3 as in Equation (1), 

/(ff3i,G) = i. ^ iTriel (1) 

e—{u,v)GE 

Now we need to count 2-star patterns (i.e., 332 ). For any edge e = {u,v), 
let Stare be the set of all nodes that form a 2 -star with e, and |Stare| be the 
number of such star patterns. A 2-star pattern (m, v, w) exists, if and only if w is 
connected to either uor v but not both. Accordingly, Stare = Star„UStart,, where 
Star„ and Star„ are the set of nodes that form a 2-star with e centered at u and v 
respectively. More formally, Star„ can be defined as Star„ = {la g A/(m)\{v}|u; ^ 
A/(a)}, and Star^ can be defined as Star„ = {w g J\f{v) \ {u}|w ^ A/’('u)}. 

Similar to counting triangles. Algorithm 1 counts each 2-star pattern two 
times (one time for each edge in the 2-star). Thus, we divide the sum for all 
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edges by 2 as follows, 

/(ff3„G)=2. E |Star„| + |Star„| (2) 

e—{u,v)^E 

Disconnected graphlets of size k = 3. 

There are two disconnected graphlets of size k = 3 nodes, (i.e., the 3-node-l- 
edge pattern) and (i.e., the independent set defined on 3 nodes) (see Table 1). 
Lines 16 and 21 show how to count these patterns. 

Equation (3) shows that the number of 3-node-l-edge graphlets per edge e is 
equivalent to the number of all nodes that are not in the neighborhood subgraph 
(egonet) of edge e (i.e., V \ {M{u) U A/’(r')}), 

/(ff33,G)= E 1^1 -A(«)UAA(u)| (3) 

e—{u^v)GE 

where |A/’(m) U A/’(u)| = |Trie| + |Stare| + |{u,r'}|. Note that the number of 3- 
node-l-edge graphlets can be computed in o(l) for each edge. 

Given that the total number of graphlets of size 3 nodes is ('^), Equation (4) 
shows how to compute the frequency of g^^, which clearly can be done in o(l), 

/(53„G)= (^’ 3 ’) -(/(53,,G) + /(g33,G) + /(533,G)) (4) 

The complexity of counting all graphlets of size fc = 3 is 0{\E\.A) asymptot¬ 
ically as we show next in Lemma 1. 

Lemma 1. Algorithm 1 counts all graphlets of size k = 3-nodes in 0{\E\.A). 

Proof. For each edge e = (m, v) such that e € E, the runtime complexity of count¬ 
ing all triangle and 2-star patterns incident to e (i.e., Trie, Stare respectively) is 
0(|A/’('u)| -I- |A/’(v)|), and is asymptotically 0(A) where A is the maximum degree 
in the graph. Further, the runtime complexity of counting all 3-node-l-edge pat¬ 
terns of size k = 3 incident to e can be counted in constant time o(l). Therefore, 
the total runtime complexity for counting all graphlets of size fc = 3 in the graph 

isG( E(A + o(1))) =G(|E|.A). □ 

^ eeE ^ 


4. Counting Graphlets of Size {k = 4) Nodes 

An exhaustive search of the egonet of any edge to count all 4-node graphlets 
independently yields G(A^) asymptotically, where A is the maximum degree in 
G. Clearly, exhaustive search is computationally intensive for large graphs. On 
the other hand, our approach is hierarchical and more efficient as we show next. 

For each edge e = {u,v), we start by finding triangles and 2-star patterns. 
Our central principle is that any 4-node graphlet 54 . can be decomposed into 
four 3-node graphlets (Gross et ah, 2013), obtained by deleting one node from 
54 - each time. Thus, we jointly count all possible 4-node graphlets by leveraging 
the knowledge obtained from finding 3-node graphlets and some combinatorial 
arguments that describe the relationships between pairs of graphlets. We sum¬ 
marize this procedure in the following steps: 
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• Step 1: For each edge e, find all neighborhood nodes forming triangle and 
2 -star patterns with e. 

• Step 2: For each edge e, use the knowledge from step 1 to count only 4-cliques 
and 4-cycles. 

• Step 3: For each edge e, use the knowledge from step 1 and some combinatorial 
arguments to compute unrestricted counts for all 4-node graphlets in constant 
time. 

• Step 4: Merge the counts from all edges in the graph, and use combinato¬ 
rial arguments involving unrestricted counts to obtain the counts of all other 
graphlets. 


Note that we refer to the unrestricted counts as the counts that can be com¬ 
puted in constant time and using only the knowledge obtained from step 1 . 
Next, we discuss the details of our approach. We start by discussing the graphlet 
transition diagram to show the pairwise relationships between different 4-node 
graphlets. Then, we discuss a general principle for counting 4-node graphlets, 
which leverages the graphlet transition diagram and some combinatorial argu¬ 
ments to improve the performance of graphlet counting. 


4.1. Graphlet Transition Diagram 

Assume that each graphlet is a state. Fig. 1 shows all possible ±1 edge transitions 
between the states of all 4-node graphlets. We can transition from one graphlet 
to another by the deletion (denoted by dashed right arrows) or addition (denoted 
by solid left arrows) of a single edge. We define six different classes of possible 
edge roles denoted by the colors from black to orange (see Table in the top-right 
corner in Fig. 1). An edge role is an edge-level connectivity pattern (e.g., a chord 
edge), where two edges belong to the same role (i.e., class) if they are similar in 
their topological features. For each edge, we define a topological feature vector 
that consists of the number of triangles and 2-stars incident to this edge. Then, 
we classify edges to one of the six roles based on their feature vectors. Thus, 
all edges that appear in 4-node graphlets are colored by their roles. In addition, 
the transition arrows are colored similar to the edge roles to denote which edge 
type should be deleted/added to transition from one graphlet to another. Note 
that a single edge deletion/addition changes the role (class) of other edges in 
the graphlet. The table in the top-left corner of Fig. 1 shows the number of edge 
roles per each graphlet. 

For example, consider the 4-clique graphlet {g 4 i), where each edge partic¬ 
ipates exactly in two triangles. Therefore, all the edges in a 4-clique graphlet 
(( 74 J belong to the first role (denoted by the black color). Similarly, consider 
the 4-chordalcycle ( 543 ), where each edge (except the chord edge) participates 
exactly in one triangle and one 2-star. Therefore, all edges in a 4-chordalcycle 
” 343 ” belong to the second role (denoted by the blue color) except for the chord 
edge which belongs to the first role (denoted by the black color). Fig. 1 shows 
how to transition from the 4-clique to the 4-chordalcycle ”( 742 ” by deleting one 
(any) edge from the 4-clique. 
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Fig. 1. (4—node) graphlet transition diagram: Figure shows all possible ±1 edge transi¬ 
tions between the set of all 4-node graphlets. Dashed right arrows denote the deletion of one 
edge to transition from one graphlet to another. Solid left arrows denote the addition of one 
edge to transition from one graphlet to another. Edges are colored by their feature-based roles, 
where the set of feature are defined by the number of triangles and 2-stars incident to an edge 
(see Table in the top-right corner). We define six different classes of edge roles colored from 
black to orange (see Table in the top-right corner). Dashed/solid arrows are colored similar to 
the edge roles to denote which edge would be deleted/added to transition from one graphlet 
to another. The table in the top-left corner shows the number of edge roles per each graphlet. 


4.2. General Principle for Counting Graphlets of size k = A 

Generally speaking, suppose we have distinct 4-node subgraphs that con¬ 
tains an edge e = (u, v ), 

,w,r} \ w,r G V \ {u, v} Aw ^ (5) 

Each subgraph {u, u, w, r} in this collection may satisfy one or two properties 
G A = {T, Su, Sy, I}. These properties describe the topological properties 
of nodes w and r with respect to edge e, such that Ay, = ai if {u,v,w} forms 
subgraph pattern and Ay = aj if {u, u,r} forms subgraph pattern aj. For 
example, Ay, = T A w forms a triangle with e, and Ay, = Sy or Sy if w forms a 2- 
star with e centered around u or v respectively. Also, Ay, = / if w is independent 
(disconnected) from e. We clarify these properties by example in Fig. 2. 

Let denote the number having properties ai,aj G A, 


7V(e) = 


Ir 


< {u,v,w,r} 

Aw^r > 


AAuj=ai ,Ar=aj 


( 6 ) 


Now that we defined the topological properties of nodes w and r relative 
to edge e, we need to define whether nodes w and r are connected themselves. 
Let e'yjy represent whether w and r are connected or not, such that e'y,y = 1 
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Av, = I Av^ = / 


Fig. 2. Let T denote the nodes forming triangles with edge (u^v) (i.e., ^ 25 ^ 3 ), whereas Su 
and Sv denote the nodes forming 2 -stars centered at u and v respectively (i.e., Vi,V 4 ), and 
let I denote the nodes that are not connected to edge e (i.e., V 5 , Vg)- Further, the dotted lines 
represent edges incident to these nodes. 


if {w,r) £ E and = 0 otherwise. Accordingly, let denotes the 

number of d-node graphlets {u,v,w,r}, where w,r satisfy property ai,aj G A 
and G {0,1}, 


N, 


(e) 


ai .a, ,e' 


{u, V, w, r} 


■utjrGVXIu,?;} 

Aw^r 

AA-w —ai,Ar=aj 

Ae;„^G{ 0 ,l} 


(7) 


For example, is the number of all graphlets {u,v,w,r} containing 

edge e, where both w and r are forming triangles with e and there exist an edge 
between w and r. Using Equations (6) and (7), we provide a general principle 
for graphlet counting in the following theorem. 


Theorem 1. General Principle for Graphlet Gounting: Given a graph G, for 
any edge e = (m, v) in G, and for any properties ai,aj G A, the number of A-node 
graphlets {u, u, r} satisfies the following rule, 


n = 77^®) - 1 

ai,aj,0 ai,ajA 


( 8 ) 


Proof. Suppose there is a subgraph {rt, v, w, r} containing edge e, where nodes w 
and r satisfy ai,aj properties respectively, and {w,r) G E. Then the expression 

(e) 

on the right side counts this subgraph once in the Nafaj term, and once in 
the N^\, By the principle of inclusion-exclusion (Stanley, 1986), the total 

contribution of the subgraph {u,v,w,r} in g is zero. Thus, g is the 

number of graphlets having properties ai,aj, but {w,r) ^ E. □ 


(e) (e) 

Clearly, it is sufficient to compute Nafa^ and only, and use Theorem 1 

to compute constant time. Note that Naf^a, is an unrestricted count 

and can be computed in constant time using the knowledge we have from finding 
3-node graphlets. 

To simplify the discussion in the following sections, we precisely show how to 
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compute -W’iA , the number of 4-node graphlets {u, v, w, r} such that w, r satisfy 
property a^, Uj G A respectively. Let Wai be the set of nodes with property ai G A 
(i.e., Wai = {w € \ {m, u} I Ayj = G ^}), and similarly TZa^ be the set 

of nodes with property aj G A (i.e., TZa^ = {r G V \ {m, w} | Ar = aj,\/aj G ^}). 
If Oi = Gj, then Wai = - Thus, 

=^-(|Wa.|-l).|Wa.| (9) 

However, if ^ aj, then Wai 9'^'^ are mutually exclusive (i.e., Wai 

7^a, =0). 

Thus, we get the following, 

=|Wa.|.|7^aJ (10) 


4.3. Analysis & Combinatorial Arguments 


In this section, we discuss combinatorial arguments involving unrestricted counts 
that can be computed computed directly from our knowledge of 3-node graphlets. 
These combinatorial arguments capture the relationships between the counts of 
pairs of 4-node graphlets. The proofs of these relationships are based on Theo¬ 
rem 1 and the transition diagram in Fig. I. For each pair of graphlets 54 . and 
g ^., we show the relationship for each edge in the graph (in Corollary 1-14), then 
we show a generalization for the whole graph (in Lemma 2-8). 

4 . 3 . 1 . Relationship between A-Cliques & A-ChordalCycles 


Corollary 1. For any edge e = {u,v) in the graph, the number of 4-cliques 
containing e is N!jf}p . 

Corollary 2. For any edge e = {u, v) in the graph, the number of A-chordalcycles, 
where e is the chord edge of the cycle (denoted by the black color in Fig. 1), is 

Lemma 2. For any graph G, the relationship between the counts of A-cliques 
(i.e., f{gAnG)) and A-chordalcycles (i.e., fig 42 ,G)) is, 

eeE ^ 1 


Proof. From Theorem 1 and the addition principle (Stanley, 1986), the total 
count for all edges in G is. 


/ . -''T.T.O ~ -''T.T,! 

e&E eeB e&E 


( 11 ) 


Given that N^}p is the number of 4-node subgraphs {u, v, w, r} containing e, 

such that Aia = T, Ar = T. Thus, from Eq. (9), N^}p = (^"^ 2 ■ From Corollary 1, 
each 4-clique will be counted 6 times (once for each edge in the clique). Thus, 
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the total count of 4-cliques in G is /(< 74 i,G) = X) Similarly, from 

eeE 

Corollary 2, each 4-chordalcycle is counted only once for each chord edge. Thus, 
the total count of 4-chordalcycles in G is f{g 42 ,G) = ^tto- direct 

eGE 

substitution in Eq. (11), this lemma is true. □ 

4-3.2. Relationship between A-Cycles & A-Paths 


Corollary 3. For any edge e = (u,v) in the graph, the number of A-cycles 
containing e is g . 

Corollary 4. For any edge e = {u, v) in the graph, the number of A-paths con¬ 
taining e, where e is the middle edge in the path (denoted by the green color in 

Fig. 1), is 

Lemma 3. For any graph G, the relationship between the counts of A-cycles 
(i.e., fig 44 ,G)) and A-paths (i.e., f{g 4 ^,G)) is, 

^ |Star„|.|Star„| -A.f{g 4 ^,G) 

eeE 


Proof. From Theorem 1 and the addition principle (Stanley, 1986), the total 
count for all edges in G is. 


E<k.o = E^^ 


» 


eGE 


eGE 


eeE 


( 12 ) 


Given that g is the number of 4-node subgraphs {u, v, w, r} containing e, 
such that w,r = S^, Aj. = Sy. Thus, from Eq. (10), Ng^^ g^ = |Star„|.|Star„|. 
From Corollary 3, each 4-cycle will be counted 4 times (once for each edge in 
the cycle). Thus, the total count of 4-cycles in G is f{g 4 ,^,G) = |. X) i- 

eeE 

Similarly, from Corollary 4, each 4-path is counted only once for each middle edge 
in the path. Thus, the total count of 4-paths in G is f{g 4 e,G) = o- 

eeE 

By direct substitution in Eq. (12), this lemma is true. □ 


4-3.3. Relationship between A-TailedTriangles & A-ChordalCycles 


Corollary 5. For any edge e = (u, v) in the graph, the number of A-tailedtriangles 
where e is part of both the triangle and 2-star patterns (denoted by the blue color 

in Fig. 1), is 1VtEvS„,o- 

Corollary 6. For any edge e = {u, v) in the graph, the number of A-chordalcycles 
where e is a cycle edge (denoted by the blue color in Fig. 1), is N^g^.^g^ 

Lemma 4. For any graph G, the relationship between the counts of A-chordalcycles 
(i.e., f{g 4 ^,G)) and A-tailedtriangles (i.e., f{g 43 ,G)) is, 

2-/(ff43,G) = ^ |Trie|.(|Star„| -h |Star^|) -A.f{g 4 ^,G) 
eeE 
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Proof. From Theorem 1 and the addition principle (Stanley, 1986), the total 
count for all edges in G is, 


eeE 


eeE 


e) 


^'^T,S„VS„ 

-T.Ak. 


eGE 

II 


, such that Ay, = T, A, 


VS„,1 


(13) 


{u,v,w,r} containing e, such that = T, Ar = Sy. Thus, from Eq. (10), 
^TS vS = |Trie|.(|Star„| + |Star„|). Now, from Corollary 6 , each 4-chordalcycle 
is counted 4 times (once for each edge in the cycle). Thus, the total count of 
4-chordalcycle in G is f{g 42 ,G) = X) ^t\' vs i- Similarly, from Corollary 5, 

eGE 

each 4-tailedtriangle will be counted 2 times (once for each blue edge as in Fig. 1). 
Thus, the total count of 4-tailedtriangle in G is 7 ( 543 , G) = 5 . X) 


By direct substitution in Eq. (13), this lemma is true. 


eeE 


T,S„VS„,0- 

□ 


4.3.4- Relationship between 4-TailedTriangles & 3-Stars 


Corollary 7. For any edge e = {u, v) in the graph, the number of A-tailedtriangles 
with e as the tail edge (denoted by the green color in Fig. 1) and u is part of the 

triangle, is 

In a similar fashion, the number of 4-tailedtriangles with e as the tail edge and 
V is part of the triangle is g Thus, the total number of 4-tailedtriangles 
with e as the tail edge and m V u is part of the triangle is Ng^^g ^ = Ng^^ g ^ + 

-''^S„,S„,l' 

Corollary 8. For any edge e = {u, v) in the graph, the number of 3-star centered 
around u is Ng^^ g^ p. 

Again, the number of 3-stars centered around v is Ng^'^ g q. Thus, the total 
number of 3-stars centered around m or u is Ng^'^g q = Ngi^ g^ p -I- Ng^^ g^ p. 


Lemma 5. For any graph G, the relationship between the counts of 3-stars (i.e., 
7 ( 543 ,G)J and 4-tailedtriangles (i.e., f{g 4 ,^,G)) is. 



/(<743,G) 


Proof. From Theorem 1 and the addition principle (Stanley, 1986), the total 
count for all edges in G is, 


eeE 


eeE eeE 


(14) 


(e) (e) (e) 

Given that Ng g = Ng g -|- Ng g is the number of 4-node subgraphs 
{u, V, w, r} containing e, such that Ay, = Sy /\ Ay = Sy or Ay, = Sy A Ay = Sy. 
Thus, from Eq. (9), Ng^'^g = q- ^|Star„|^^ Now, from Corollary 8 , each 

3-star is counted 3 times (once for each edge in the star). Thus, the total count 
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of 3-stars in G is 7 ( 545 , G) = ^- o- Similarly, from Corollary 7, each 

eeE 

4-tailedtriangle will be counted once for each tail edge (denoted by the green 
color in Fig. 1). Thus, the total count of 4-tailedtriangle in G is 7 ( 543 , G) = 

1 - This holds whether the patterns are centered around u or v. By 

eeE 

direct substitution in Eq. (14), this lemma is true. □ 

4 . 3 . 5 . Relationship between 4-TailedTriangles & 4-Node-1-Triangles 

Corollary 9. For any edge e = {u,v) in the graph, the number of 4-node-1- 
triangle is q. 

Corollary 10. For any edge e = (u, v) in the graph, the number of 4-tailedtriangles 
with e partieipating in the triangle but not eonneeted to the tail edge (denoted by 
the red color in Fig. 1), is 

Proof. Suppose there is a subgraph {u,v,w,r} containing e. {u,v,w,r} is a 4- 
tailedtriangle with e participating in the triangle but not connected to the tail 
edge, if and only if there are some nodes w,r such that w G Trie, r /N{e), 
and {w,r) G E. This means r is independent of e, and w forms a triangle with 
e. As such. Aw = T and = I and e'^r = 1- More generally, any subgraph 
{u,v,w,r} containing e contributes once in the count if and only if it is 

a 4-tailedtriangle with e participating in the triangle but not connected to the 
tail edge. In Theorem 1, we showed that N!jf\ < n!^\. □ 


Lemma 6. For any graph G, the relationship between the counts of 4-tailedtriangles 
(i.e., 7 ( 543 , G)^ and 4-node-1-triangles (i.e., 7 ( 547 , G)j is, 

i.f{gi,,G) = ^ (Trie. (|C| - |AA(m) U AA(u)|) ) - 7(543, G) 

eeE 


Proof. From Theorem 1 and the addition principle (Stanley, 1986), the total 
count for all edges in G is. 


eeE 


E4:i-E4ii 

eeE eeE 


(15) 


Given that n!^\ is the number of 4-node subgraphs {u,v,w,r} containing 
e, such that Aw = T,Ar = F And, the number of nodes independent of e is 
\V\ — |A/’(m) U A/’(v)|. Thus, from Eq. (10), N^p\ = Trie.(|E| — lA/”)?/) U A/’(u)|y 

Now, from Corollary 10, each 4-tailedtriangle is counted one time (once for the 
red edge as in Fig. 1). Thus, the total count of 4-tailedtriangles in G is 7 ( 543 , G) = 
\ 1 - Similarly, from Corollary 9, each 4-node-1-triangle will be counted 3 

e^E 

times (once for each edge in the triangle). Thus, the total count of 4-node-l- 
triangles in G is 7 ( 547 , G) = 5 - ^t\ o- direct substitution in Eq. (15), 

e^E 


this lemma is true. 


□ 
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4 . 3 . 6 . Relationship between A-Paths & A-node-2-Stars 


Corollary 11. For any edge e = (u,v) in the graph, the number of 4-paths 
where e is the start or end of the path (denoted by the purple color in Fig. 1), is 

Aj-ie) 

Corollary 12. For any edge e = {u,v) in the graph, the number of 4-node-2- 
stars where e is one of the star edges (denoted by the purple color in Fig. 1), is 
/y(e) 

Lemma 7. For any graph G, the relationship between the counts of 4-paths (i.e., 
f{g 4 g,G)) and 4-node-2-stars (i.e., f{g 4 g,G)) is, 

2 -/(ff4s,G) = |Star,|.(|F| - Mu) U Miv)\) - 2.f{gi^,G) 

eeE 


Proof. From Theorem 1 and the addition principle (Stanley, 1986), the total 
count for all edges in G is, 


eeE 


eeE 


eeE 


(16) 


Given that i i number of 4-node subgraphs 

{u,v,w,r} containing e, such that = Su^ S^, = I. And, the number of 

nodes independent of e is \V\ — |A/’(u) U A/’(v)|. Thus, from Eq. (10), i ~ 

|Stare|. (|E| — |A/’('u) U A/’(u)|), such that |Stare| = |Star„| + |Star„|. Now, from 
Corollary 11, each 4-path is counted 2 times (for both the start and end edges 
in the path, denoted by the purple in Fig. 1). Thus, the total count of 4-paths 

in G is f{g 4 g,G) = \. 1 1 - Similarly, from Corollary 12, each 4-node- 

eeE 

2 -star will be counted 2 times (once for each edge in the star, denoted by the 
purple in Fig. 1). Thus, the total count of 4-node-2-star in G is fig 4 g,G) = 

5 - 10 - direct substitution in Eq. (16), this lemma is true. □ 

eeE 


4 . 3 . 7 . Relationship between 4-node-2-edges & 4-node-l-edge 

Corollary 13. For any edge e = {u,v) in the graph, the number of 4-node-2- 
edges where e is any of the two independent edges in the graphlet (denoted by the 

orange color in Fig. 1), is ^. 

Corollary 14. For any edge e = (u, v) in the graph, the number of 4-node-l-edge 
where e is an isolated/single edge in the graphlet (denoted by the orange color in 

Fig. 1), is 

Lemma 8 . For any graph G, the relationship between the counts of 4-node-2- 
edge graphlets (i.e., f(g 4 g,G)) and 4-node-l-edge graphlets (i.e., /(g 4 jo,G)J is. 



18 


N.K. Ahmed et al 


Proof. From Theorem 1 and the addition principle (Stanley, 1986), the total 
count for all edges in G is, 


E^So = E^S- 


eGE 


eGE 


eeE 


(17) 


Given that is the number of 4-node subgraphs {u,v,w,r} containing 
e, such that = I, A.^. = I. And, the number of nodes independent of e is 
\V\ — |A/’('u) U A/’(v)|. Thus, from Eq. (9), ivj®] = . Now, from 

Corollary 13, each 4-node-2-edge is counted 2 times (for the two edges in the 
graphlet, denoted by the orange in Fig. 1). Thus, the total count of 4-node- 
2-edges in G is /((/ 4 g,G) = |- X) EE- Similarly, from Corollary 14, each 

eeE 

4-node-1-edge will be counted once (for the isolated/single edge in the graphlet, 
denoted by the orange in Fig. 1). Thus, the total count of 4-node-1-edge in 

G is /( 54 ^o,G) = EE- direct substitution in Eq. (17), this lemma is 

eGE 

true. □ 


While it is straightforward to compute Nj j for each edge e, this is not the case 

(e) (e) 

for Nj I ^ or Nj / q, as they require searching outside the local edge neighborhood. 

However, since ^ is the number of edges outside the egonet of e, it can be 
computed as, 

E:|i = \E\ - Mu) \ {^}| - |A7(u) \ Ml - |{e}| 

— [E,t,i +-^E,.vs„,i+-^E.i] 

“ [EE.i+EE„.i+E^'E] 

Thus, the total number of 4-node-2-edges is, 

2-/(ff4g,G) = ^E:7,i (18) 

eeE 

= ^ |E| - |A7(77) \ {u}| - |A7(u) \ Ml - l{e}| 

eeE 

- [6-/(ff4i, G) -I- 4./(^ 42 , G) -I- 2 . 7 ( 543 , G)] 

~ [4-/(ff44 : G) -I- 2 . 7(543 , G)] 

Finally, the number of 4-node-independent graphlets ( 54 , 1 ) is, 

/(ff44mG)= -^7(54.,G) (19) 


4.4. Algorithm 

Algorithm 2 (GraphletCounting) shows how to count all graphlets of size k = 
{3,4} nodes efficiently (using Lemma 2— 8). As discussed previously, we start 
by finding all triangle and 2-star patterns in Lines 7-15 (i.e., Step 1). Then, in 
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Lines 18—19 we only count 4-cliques and 4-cycles (i.e., Step 2). Then, Lines 21— 
32 compute unrestricted counts for all 4-node graphlets in constant time (using 
knowledge from Step 1 and 2, i.e., Step 3), and finally Lines 35—37 compute 
the final counts (using the lemma proved in Section 4.3) (i.e.. Step 4). Our 
approach counts all 4-chques and 4-cycles in 0(m.A.Tmax) and 0{m.A.Smax) 
respectively, where T^ax is the maximum number of triangles incident to an edge 
and Tmax ^ A for sparse graphs, and Smax is the maximum number of stars 
incident to an edge and Smax < A, as we show in Lemma 9 and 10. This is more 
efficient than (!l(|fo|.A^) given by (Shervashidze et ah, 2009), and 0{A.\E\ + \E\‘^) 
given by (Marcus and Shavitt, 2012). 

Lemma 9. Alg. 2 counts all A-cliques in 0{\E\.A.Tmax), where Tmax is the 
maximum number of triangles incident to an edge. 

Proof. For each edge e = (u, v) € E, the runtime complexity of counting all 4- 
cliques incident to e is equivalent to finding the set of all edges e' = {w, w') such 
that {e' = {w,w') € E\w,w' € Trie A re yf w'}, where Trig is the set of triangles 
incident to e. First, we show in Lem. 1 that the runtime complexity of finding 
all triangles incident to e is 0{A). Second, as described in Alg. 2 the runtime 
complexity of checking whether any two distinct nodes w, w' € Trig are connected 
by an edge e' = {w,w') is 0{ A) = (!l(|Trie|.A), and can be computed 

lo^Trie 

asymptotically O{Tmax-A), where Tmax is the maximum triangle degree (i.e., the 
maximum number of triangles incident to an edge and Tmax ^ A). Therefore, 

the total runtime complexity is of ^ (A + Tmax-A)) = 0{\E\.A.Tmax)- □ 

Lemma 10. Alg. 2 counts all A-cycles of size k = A in 0{\E\.A.Smax) > where 
Smax is the maximum number of 2-stars incident to an edge (proof is similar to 
Lem. 9). 

Proof. For each edge e = {u, v) € E, the runtime complexity of counting all 
4-cycles incident to e is equivalent to finding the set of all edges e' = {w,w') 
such that {e' = {w,w') G E\w G Star„ A w' G Start,,^; ^ w'}. First, we show in 
Lem. 1 that the runtime complexity of finding all 2-star patterns incident to e is 
0(A). Second, Alg. 2 shows the runtime complexity of checking whether any two 
distinct nodes w G Star„, and w' G Star„ are connected by an edge e' = (w, w') is 
A) = 0(|Star„|.A), and is asymptotically 0{Smax-A) (where Smax is 

luGStaru 

the maximum number of 2-stars incident to an edge, and Smax A A). Therefore, 
the total runtime complexity is 0( X) (A + Smax-A)j = 0{\E\.A.Smax)- □ 


5. Experiments 

We proceed by first demonstrating how fast our algorithm (Algorithm 2) counts 
all graphlets of size k = {3,4} (both connected and disconnected graphlets) on 
various networks. We make all our implementations, further experiments, and 
proofs available in an online appendix^. In this paper, we show detailed results for 


^ http://nesreenahmed.com/graphlets 
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Algorithm 2 Our exact graphlet census algorithm for counting all 3,4-node graphlets. 
The algorithm takes an undirected graph as input and returns the frequencies of all 3,4-node 
graphlets 


1 

procedure GraphletCounting(G = (V, E)) 


2 

Initialize Array X 


3 

Nt,T = 0, = 0i ^T,Su'JS^ = 0, = 0 


4 

Nt,I = 0, = 0, N[J = 0, = 0 


5 

parallel for e = (u,u) G E do 


6 

Star„ = 0, Star„ = 0, Trie = 0 


7 

for w G Af{u) do 


8 

if 1 C = u then continue 


9 

Add w to Starti and set X{w) = 1 


10 

for w G M{v) do 


11 

if 1 C = li then continue 


12 

if X('w) = 1 then 

> found triangle 

13 

Add 1 C to Trie and set A'(ic) = 2 


14 

Remove ic from Startt 


15 

else Add ic to Star^, and set A(ic) = 3 


16 

Compute /(^ 3 ,G) as in Lines 14—16 of Alg. 1 


17 

// Get Counts of 4-Cliques Sz 4-Cycles 


18 

/(g 4 j,G) += CLIQUECOUNT(X,Trie) 


19 

/(g 44 ,G) += CycleCount(X, Star^) 


20 

// Get Unrestricted Counts for 4-Node Connected Graphlets 


21 

Nt,t += ('”^ 2 '') 


22 

^S-u,S^ += |Star„|.|Star„| 


23 

-^^T,S„vS„ += |Trie|.(|Staru| + |Star„|) 


24 



25 

^S.,S. += 


26 

// Get Unrestricted Counts for 4-Node Disconnected Graphlets 


27 

Nt,i += Trie.(|y| - |VH U V(i;)|) 


28 

Ns.^,1 = |Star„|.(|y| - |A/'(tt) U A/'Cn)!) 


29 

^S„,I = |Star„|.(|y| - 1 ^( 11 ) UA/'(r)|) 


30 

^S^VS„,I += ^Su,I + ^S„,I 


31 

JV,- j += (lV|-|A^(«)uAr(«)|^ 


32 

Nij,i += \E\ - |Ar(u) \ {,;}| - |A7(4;) \ {u}| - 1 


33 

for 1 C G Af{v) do X(ic) = 0 


34 

end parallel 


35 

Use Lemma 2 —6 to compute f{g 4 ^ , G) for i = 1 : 8 


36 

Use Eq. (18) to compute f{g 4 Q,G) and Lemma 8 for /(p 4 ^Q,G) 


37 

Use Eq. (19) to compute fig4ii,G) 


38 

return /(gs, G), f{g 4 , G) 


39 

procedure CliqueCount(A’, Trie) 


40 

cliqg = 0 


41 

for each node ic G Trie do 


42 

for r G A/*(ic) do 


43 

if X(r) = 2 then cliq^ += 1 

> found 4-Clique 

44 

X{w) = 0 


45 

return cliq^ 


46 

procedure CycleCount(A, Star^) 


47 

cycg = 0 


48 

for each node ic G Star^ do 


49 

for r G M{w) do 


50 

if X{r) = 3 then cyc^ += 1 

> found 4-Cycle 

51 

X{w) = 0 


52 

return cyCg 
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60 networks categorized in 8 broad classes from social, facebook (Traud, Mucha 
and Porter, 2012), biological, web, technological, co-authorship, infrastructure, 
among other domains (Rossi and Ahmed, 2015a) (see the links^ for data down¬ 
load). And, in the online appendix, we present a more extensive collection of 
300-1- networks, including both large sparse networks as well as dense networks 
from the DIMACs challenge^. Note that for all of the networks, we discard edge 
weights, self-loops, and edge direction. To the best of our knowledge, this is the 
largest study for graphlet counting, and these are the largest graphlet compu¬ 
tations published to date. Our own implementation of Algorithm. 2 uses shared 
memory, but the algorithm is well-suited for other architectures. 


5.1. Efficiency & Runtime 

Table 2 describes the properties of the 60 networks considered here. It also shows 
the counts of graphlets of size k = {3,4} and states the time (seconds) taken to 
count all graphlets. We only show counts of connected graphlets due to space 
limitations, however all counts are available in the online appendix. Notably, 
Algorithm 2 takes only few seconds to count all graphlets for large social, web, 
and technological graphs (among others). For example, for a large road network 
(i.e., inf-road-usa) with 24M nodes and 29M edges. Algorithm 2 takes only 4 
seconds to count all graphlets. Also as shown in Table 2, for large facebook 
networks with nearly 2M edges. Algorithm 2 takes only 15 seconds, and for large 
web graphs with nearly 8M edges. Algorithm 2 takes only 25 seconds. 

We compare the empirical runtime of Algorithm 2 to the state-of-the-art 
baseline method RAGE (Marcus and Shavitt, 2012). For social and facebook 
networks, we observed that Algorithm 2 is on average 460x faster than RAGE. 
Eor all other networks, we observed that Algorithm 2 is on average 600x faster 
than RAGE. Notably, Algorithm 2 takes only 7 seconds to count graphlets of 
facebook networks with 1.3M edges, while RAGE takes almost an hour for the 
same networks. For larger networks with millions of nodes/edges, RAGE was 
timed out (as it did not finish within 30 hours of runtime). Moreover, for dense 
graphs from the DIMACS challenge, RAGE takes almost 17 minutes, while Al¬ 
gorithm 2 takes less than a second. We also compared to the baseline method 
EANMOD (Wernicke and Rasche, 2006) and Orca (Hocevar and Demsar, 2014), 
we found that for a facebook network with 250k edges, EANMOD takes roughly 
2.5 hours for counting all graphlets, RAGE takes almost 7 minutes for the same 
network, and Orca takes almost 10 seconds, while Algorithm 2 takes less than a 
second. Note that both RAGE and Orca count only connected graphlets, while 
our algorithm and FANMOD count both connected and disconnected graphlets. 

In Figure 3, we plot the runtime of Algorithm 2 for a representative subset 
of 150 social and information networks. The figure shows that our algorithm 
exhibits nearly linear-time scaling over networks ranging from IK to lOOM nodes. 


^ http://networkrepository.com/ 

^ http: //dimacs . rutgers. edu/Challenges/ 
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Table 2. Runtime S>c Statistics for a Subset of 60 Networks. The numbers are appended by K 
for thousands, M for millions, B for billions, T for trillions, and P for quadrillions. 













Seconds 

graph 

|V| 

l-^l 

l93i 

532 1 

1941 1 

1942 1 

1944 1 

1946 1 

1945 1 

1943 1 

Alg.2 

RAGE 

soc-brightkite 

57k 

213k 

494k 

12M 

2.9M 

12M 

2.7M 

533M 

1.3B 

114M 

0.2 

273.03 

socfb-Berkeleyl3 

23k 

852k 

5.4M 

125M 

27M 

153M 

87M 

17B 

25B 

2.7B 

4.94 

2514.59 

socfb-Wisconsin87 

24k 

836k 

4.9M 

107M 

23M 

121M 

59M 

12B 

21B 

l.OB 

3.93 

1450.31 

socfb-FSU53 

28k 

l.OM 

7.9M 

130M 

63M 

242M 

95M 

16B 

lOB 

2.9B 

5.55 

2192.94 

socfb-MSU24 

32k 

I.IM 

6.5M 

139M 

33M 

183M 

106M 

16B 

32B 

2.6B 

5.67 

1904.09 

socfb-Texas80 

32k 

1.2M 

9.6M 

160M 

68M 

316M 

122M 

21B 

IIB 

3.9B 

7.53 

2967.01 

socfb-Michigan23 

30k 

1.2M 

8.3M 

162M 

49M 

277M 

146M 

23B 

13B 

3.5B 

7.57 

2995.83 

socfb-lndiana69 

30k 

1.3M 

9.4M 

181M 

60M 

269M 

141M 

25B 

13B 

3.8B 

8.44 

3212.10 

socfb-Ulllinois20 

31k 

1.3M 

9.4M 

172M 

64M 

273M 

130M 

23B 

27B 

3.8B 

7.88 

3088.77 

socfb-UF21 

35k 

1.5M 

12M 

266M 

98M 

433M 

186M 

40B 

150B 

7.2B 

14.49 

N/A 

soc-flickr 

514k 

3.2M 

59M 

963M 

1.7B 

14B 

6.7B 

244B 

326B 

90B 

182.57 

N/A 

soc-orkut 

3.1M 

117M 

628M 

44B 

3.2B 

48B 

70B 

19T 

98T 

1.5T 

2694.55 

N/A 

soc-sinaweibo 

58M 

261M 

212M 

804B 

662M 

27B 

259B 

157T 

8.48P 

3.80T 

33359.7 

N/A 

soc-friendster 

65.6M 

1.8B 

4.17B 

708.IB 

8.96B 

131.4B 

307.5B 

364.7T 

247.3T 

5.79T 

N/A 

N/A 

bio-celegans 

453 

2.0k 

3.3k 

69k 

3.0k 

37k 

4.5k 

495k 

2.9M 

363k 

<0.001 

1.7 

bio-diseasome 

516 

1.2k 

1.4k 

5.4k 

1.4k 

923 

42 

18k 

27k 

19k 

<0.001 

0.44 

bio-dmela 

7.4k 

26k 

2.9k 

572k 

393 

13k 

107k 

IIM 

9.2M 

312k 

0.01 

2.47 

bio-yeast-protein-inter 

1.8k 

2.2k 

222 

Ilk 

41 

198 

140 

31k 

72k 

2.6k 

<0.001 

0.53 

bio-yeast 

1.5k 

1.9k 

206 

Ilk 

39 

195 

139 

31k 

72k 

2.5k 

<0.001 

0.43 

bio-human-gene2 

14k 

9.0M 

4.9B 

lOB 

2.3T 

3.7T 

90B 

4.4T 

5.3T 

8.4T 

8023.84 

N/A 

bio-mouse-gene 

43k 

14M 

3.6B 

15B 

670B 

2.IT 

223B 

9.0T 

6.7T 

7.7T 

5515.6 

N/A 

ca-CSphd 

1.9k 

1.7k 

8 

6.6k 

0 

5 

8 

9.4k 

32k 

93 

<0.001 

1.25 

ca-GrQc 

4.2k 

13k 

48k 

85k 

329k 

66k 

1.1k 

553k 

406k 

628k 

<0.001 

5.99 

ca-dblp-2012 

317k 

l.OM 

2.2M 

15M 

17M 

4.8M 

203k 

252M 

259M 

97M 

0.48 

227.79 

ca-cit-HepTh 

23k 

2.4M 

191M 

1.6B 

13B 

47B 

7.3B 

538B 

976B 

385B 

132.66 

N/A 

ca-cit-HepPh 

28k 

3.1M 

196M 

1.5B 

9.8B 

34B 

6.IB 

536B 

479B 

276B 

125.49 

N/A 

ca-coauthors-dblp 

540k 

15M 

444M 

698M 

15B 

3.4B 

31M 

42B 

27B 

67B 

40.26 

N/A 

ca-hollywood-2009 

I.IM 

56M 

4.9B 

33B 

1.4T 

635B 

168B 

21T 

17T 

8.9T 

13799.6 

N/A 

tech-as-caida2007 

26k 

53k 

36k 

15M 

54k 

1.7M 

407k 

285M 

7.8B 

47M 

0.19 

36.83 

tech-p2p-gnutella 

63k 

148k 

2.0k 

1.6M 

16 

826 

42k 

15M 

8.1M 

71k 

0.02 

7.44 

tech-RL-caida 

191k 

608k 

455k 

21M 

423k 

7.4M 

40M 

583M 

1.7B 

77M 

0.39 

71.74 

tech-WHOIS 

7.5k 

57k 

782k 

5.3M 

12M 

31M 

2.9M 

229M 

566M 

194M 

0.14 

44.52 

tech-as-skitter 

1.7M 

IIM 

29M 

16B 

149M 

20B 

43B 

819B 

96T 

162B 

476.06 

N/A 

web-BerkStan-dir 

685k 

6.6M 

65M 

28B 

I.IB 

99B 

25B 

49B 

382T 

476B 

149.17 

N/A 

web-edu 

3.0k 

6.5k 

10k 

81k 

40k 

4.6k 

18 

435k 

1.3M 

186k 

<0.001 

0.52 

web-google-dir 

876k 

4.3M 

13M 

687M 

40M 

382M 

38M 

4.IB 

650B 

6.7B 

4.45 

N/A 

web-indochina-2004 

Ilk 

48k 

210k 

481k 

1.2M 

88k 

9.2k 

5.5M 

12M 

4.9M 

0.01 

24.36 

web-it-2004 

509k 

7.2M 

339M 

56M 

29B 

815M 

175M 

I.IB 

1.4B 

527M 

25.26 

N/A 

web-baidu-baike 

2.1M 

17M 

25M 

31B 

28M 

4.5B 

9.2B 

3.3T 

571T 

327B 

3975.81 

N/A 

web-wikipedia-growth 

1.9M 

37M 

127M 

123B 

288M 

38B 

68B 

29T 

3.IP 

3.2T 

22389.2 

N/A 

web-ClueWeb09-50m 

148M 

447M 

1.2B 

494B 

5.6B 

243B 

774B 

34T 

24P 

3.4T 

15665.9 

N/A 

inf-italy-osm 

6.7M 

7.0M 

7.4k 

8.2M 

0 

244 

47k 

9.9M 

992k 

27k 

0.85 

N/A 

inf-openflights 

2.9k 

16k 

73k 

639k 

286k 

1.5M 

319k 

17M 

17M 

9.0M 

0.01 

2.46 

inf-power 

4.9k 

6.6k 

651 

17k 

90 

385 

324 

38k 

20k 

5.1k 

<0.001 

0.58 

inf-roadNet-CA 

2.0M 

2.8M 

120k 

5.6M 

40 

13k 

249k 

IIM 

2.4M 

521k 

0.35 

N/A 

inf-roadNet-PA 

I.IM 

1.5M 

67k 

3.2M 

16 

5.7k 

152k 

6.2M 

1.4M 

295k 

0.19 

N/A 

inf-road-usa 

24M 

29M 

439k 

50M 

90 

21k 

1.6M 

81M 

18M 

1.5M 

4.05 

N/A 

ia-email-EU-dir 

265k 

364k 

267k 

194M 

581k 

lOM 

6.7M 

4.4B 

221B 

341M 

1.52 

887.18 

ia-enron-only 

143 

623 

889 

4.8k 

779 

2.7k 

648 

29k 

17k 

14k 

<0.001 

0.12 

ia-reality 

6.8k 

7.7k 

400 

497k 

63 

1.7k 

2.8k 

1.6M 

26M 

93k 

<0.001 

1.39 

ia-wiki-Talk-dir 

2.4M 

4.7M 

9.2M 

13B 

65M 

l.OB 

924M 

1.2T 

192T 

64B 

281.33 

N/A 

ia-wikiquote-user-edits 

93k 

238k 

279k 

636M 

411k 

70M 

44M 

8.9B 

2.4T 

2.5B 

2.41 

691.28 

ia-wiki-user-edits-page 

2.1M 

5.6M 

6.7M 

550B 

lOM 

70B 

44B 

4.8T 

88P 

2.0T 

5691.92 

N/A 

brock200-3 

200 

12k 

291k 

570k 

3.2M 

12M 

4.1M 

IIM 

3.5M 

16M 

0.02 

22.96 

brock200-4 

200 

13k 

373k 

584k 

5.2M 

16M 

4.3M 

8.9M 

3.0M 

17M 

0.02 

21.85 

brock400-3 

400 

60k 

4.4M 

4.5M 

184M 

372M 

63M 

84M 

28M 

251M 

0.4 

997.15 

brock400-4 

400 

60k 

4.4M 

4.5M 

185M 

373M 

63M 

84M 

28M 

250M 

0.4 

1010.26 

brock800-l 

800 

208k 

23M 

38M 

1.3B 

4.IB 

I.IB 

2.4B 

801M 

4.4B 

4.11 

N/A 

brock800-2 

800 

208k 

23M 

38M 

1.3B 

4.2B 

I.IB 

2.4B 

794M 

4.4B 

4.15 

N/A 

brock800-3 

800 

207k 

23M 

38M 

1.3B 

4.IB 

I.IB 

2.4B 

802M 

4.4B 

4.1 

N/A 


N/A: timed out after 30 hours of runtime 
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Fig. 3. The empirical runtime of our exact graphlet counting (Alg.2) in social and information 
networks scales almost linearly with the network dimension. 


5.2. Scaling 

We used a 2-processor Intel Xeon 3.10 Ghz E5-2687W server, each processor has 
8 cores, and each core can run two threads. The two processors share 20MB of L3 
cache and 256GB of memory. We evaluate the speedup of our parallel algorithm 
(i.e., how much faster our proposed algorithm is when we increase the number 
of cores), and we used the OpenMP library for multi-core parallelization. In the 
following plots, we show the speedups versus the number of processing units 
(cores). All speedups are computed relative to the runtime of Algorithm 2 with 
one processor. To avoid possible variance, all experiments are repeated 5 times 
and averaged. Figures 4-5 show the speedup plots for a variety of graphs. We 
discuss a few observations from the plots presented here. 

The first and most important observation that we make is that we obtain sig¬ 
nificant speedups from the parallel implementation of Algorithm 2. Figures 4-5 
show strong scaling results for a variety of graphs from social, web, and techno¬ 
logical domains. Algorithm 2 scales to 16 cores and yields a speedup of 10-15 
folds. For example, as shown in Figure 4, we achieve almost linear scaling for the 
socfb-Penn94 graph (15-fold speedup for 16 cores). 

The second observation links the performance of Algorithm 2 to the charac¬ 
teristics of the graphs. We observe the most significant speedups for social and 
Facebook networks (see Figure 4). We obtain near linear speedup as we increase 
the number of cores. Social networks are computationally intensive relative to 
the other graphs. This is due to their clustering characteristics and the existence 
of a large number of small communities (i.e., triangles, cliques, and cycles) in 
social networks. 

The third observation we make is related to the optimal number of problems 
to dynamically assign to each processing unit when more work is requested (i.e., 
batch size b). That is the optimal performance that would be achieved when b 
jobs are assigned in batch. Overall, we observed small performance fluctuations 
and found the optimal value of b when we changed between 1 and 256 edges 
respectively. Interestingly, this observation is largely true only for sparse graphs, 
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Fig. 4. Strong scaling results for Facebook and social networks. 



Fig. 5. Strong scaling results for interaction, collaboration, technological, and web networks. 


whereas graphs that are relatively dense (e.g., DIMACs graphs) work better when 
b is small (e.g., even as small asb = 1). This is likely due to the properties of these 
graphs and the auto-optimizer that we built into the library which automatically 
adapts the implementation of the algorithms to use additional data structures 
and achieve better performance for those relatively dense graphs at the cost of 
using additional space. Thus, our auto-optimizer appropriately balances the time 
and space trade-offs. 

Note that the results for the job size experiments use degree for ordering 
the neighbors of each node in the succinct graph representation as well as for 
ordering the edge jobs to solve. In both cases, the ordering is from largest to 
smallest. 


6. Applications 

We also show some applications that could benefit from our fast graphlet count¬ 
ing algorithm (Algorithm 2), which facilitates exploring and understanding net¬ 
works and their structure. Graphlets provide an intuitive and meaningful charac¬ 
terization of a network at the global macro-level as well as the local micro-level. 






Graphlet Decomposition: Framework, Algorithms, and Applications 


25 



01 23456789 10 11 


Graph lets 


Fig. 6. Facebook social networks of California Universities. Using the space of graphlets of 
size fc = 4, Caltech is noticeably different than others, which is consistent with the findings 
in (Traud et al., 2012). 


thus, they are useful for numerous applications. At the macro-level, graphlets are 
useful for finding similar networks (graph similarity queries), or finding networks 
that disagree most with that set (graph anomalies), or exploring a time-series 
of networks, among numerous other possibilities. Alternatively, graphlets are 
also extremely useful for characterizing networks and their behavior at the local 
node/edge-level as known as the micro-level. For instance, given an edge (m, v) G 
E, find the top-k most similar edges (with applications in security, role discovery, 
entity-resolution, link prediction, and other related matching/similarity applica¬ 
tions). Also, graphlets could be used for ranking nodes/edges to find unique 
patterns and anomalies such as large stars, cliques, etc. 


6.1. Large-Scale Graph Comparison & Classification 

Graphlets are also useful for large-scale comparison and classification of graphs. 
In this case, we relax the notion of equivalence and isomorphism to some form of 
structural similarity between graphs, such that the graph similarity is measured 
using feature-based graphlet counts. In this section, we show how graphlets could 
be useful for network analysis, anomaly detection, and graph classification. 

First, we study the full data set of FacebooklOO, which contains 100 Face- 
book networks that represent a variety of US schools (Traud et ah, 2012). We 
plot the GFD (i.e., graphlet frequency distribution) score pictorially in Figure 6 
for all Galifornia schools. The GFD score is simply the normalized frequencies 
of graphlets of size k (Przulj et ah, 2004). In our case, we use fc = 4. The fig¬ 
ure shows Caltech noticeably different than others, consistent with the results 
in (Traud et ah, 2012) which shows how Caltech is well-known to be organized 
almost exclusively according to its undergraduate ’’Housing” residence system, 
in contrast to other schools that follow the predominant ’’dormitory” residence 
system. The residence system seems to impact the organization of the social 
community structures at Caltech. 
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Table 3. Accuracy Sz Standard Error for Classification of Large Collection of Biological Sz 
Chemical Graphs. We used counts of all graphlets of size k = {2,3,4} as features. 


graph 

Type 

No. Graphs 

Accuracy(%) 

Total Time(sec) Avg Time per G (sec) 

D&D 

Protein 

1178 

76.13 ± 0.03 

1.05 8.95xl0-'‘ 

MUTAG 

Chemicals 

188 

86.4 ± 0.21 

0.14 7.47xl0“'‘ 


Second, we use counts of graphlets of size k = {2,3,4}-nodes as features to 
represent each graph in a large collection of graphs. Using the graphlet fea¬ 
ture representation, we learn a model to predict the unknown label of each 
unlabeled graph (e.g., the label could be the function of protein graphs). We 
test our approach on protein graphs (D&D collection of 1178 protein graphs) 
and chemical compound graphs (MUTAG collection of 188 chemical compound 
graphs) (Vishwanathan et ah, 2010). We extract the graphlet features using Al¬ 
gorithm 2. Then, we learn a model using SVM (RBF kernel), and we use 10-fold 
validation for evaluation. Table 3 shows the accuracy of this approach is 76% 
for protein function prediction, and 86% for mutagenic effect prediction. Note 
that by using all graphlet-based features up to size 4 nodes, we were able to 
obtain better accuracy than previous work (which achieved maximum 75% and 
83% accuracy for D&D and MUTAG respectively (Shervashidze et ah, 2009)). 
Moreover, Algorithm 2 extracts all the features (graphlet counts) in almost one 
second. This yields a significant improvement over the graphlet feature extrac¬ 
tion approach that was proposed in (Shervashidze et ah, 2009), which takes 2.45 
hours to extract graphlet features from the D& D collection. 

Third, we compute graphlet counts on a 2 billion edge social network called 
Friendster. Friendster is an on-line gaming network. Before re-launching as a 
game website in 2011, Friendster was an online social network where users can 
form friendship links with each others. This data is provided by The Web Archive 
Project before the death of the social network. In these experiments, we use the 
induced subgraph of the nodes that either belong to at least one community 
or are connected to other nodes that belong to at least one community. Table 2 
shows a significantly large number of 4-path (chains of 4 connected nodes) and 3- 
stars compared to the number of 4-cliques and triangles. Although the induced 
subgraph that we used from Friendster is clearly biased toward communities, 
the patterns that represent communities, such as cliques and triangles, are less 
likely in the induced graph. For example, the frequency of 4-path patterns is 
0.58, while the frequency of 4-cliques is 0.000014. These results indicate that 
something wrong happened to the social network. Previous work on the autopsy 
of Friendster showed that there was a collapse in the community structure of 
Friendster, a cascade in user departure due to bad decisions in the design and 
interface changes. In a similar fashion, the low frequency of community-related 
graphlets (e.g., cliques) in Friendster also indicates the collapse of the social 
network. 


6.2. Finding Large Stars, Cliques, and other Patterns Fast 

How can we quickly and efficiently find large cliques, stars, and other unique 
patterns? Further, how can we identify the top-k largest cliques, stars, etc? Note 
that many of these problems are NP-hard, e.g., finding the clique of maximum 
size is a well-known NP-hard problem (Gross et ah, 2013). To answer these 
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Fig. 7. Visualization of the human diseasome network: A network of disorders 
and disease genes linked by known disorder-gene associations (Goh et al., 2007). Edges 
are weighted/colored by their number of incident star graphlets of size 4 nodes, nodes are 
weighted/colored by their triangle counts. The large star on the right denoted by light blue 
color corresponds to colon cancer; the large star on the lower left denoted by lime green color 
corresponds to deafness; and the large star on the right denoted by lime green color corresponds 
to leukemia. Notably this figure highlights the few phenotypes (such as colon cancer, leukemia, 
and deafness) correspond to hubs (large stars) that are connected to a large number of distinct 
disorders, which is consistent with (Goh et al., 2007). 


and other related queries, we leverage the proposed parallel graphlet counting 
method in Algorithm 2. The idea is clearly shown in Figure 7. Figure 7 provides 
a visualization of the human diseasome network (Goh et al., 2007), where we 
used Alg. 2 to rank (weight) all the edges in the network by the number of star 
patterns of size 4 nodes. The intuition behind the method is that if an edge (or 
node) has a (relatively) large number of stars of 4 nodes (cliques, or another 
graphlet of interest), then it is also likely to be part of a star of a large size. 
Recall that removing a node from a fc-star or A:-clique forms a star or clique of 
size k—1 (Gross et al., 2013). Accordingly, edges with large weights are likely to 
be members of large stars. Thus, as shown in Figure 7, a visualization based on 
our fast graphlet counting method can help to quickly highlight such large stars 
by using the counts (of stars of size 4 nodes) as edge weights or colors. Notably, 
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Fig. 8. Visualization of the terroristRel network: A network of terrorists and their 
relationships. Terrorists are linked to each other if they contact each other, use the same 
facility, are members of the same family, or belong to the same terrorist organization. Edges 
and nodes are weighted/colored by their number of incident cliques of size 4 nodes. Notably 
this figure highlights how the structure of terrorist networks is decomposed of various clique 
patterns (terrorist organization) and how these cliques are interconnected. The figure highlights 
the largest clique on the top left denoted by dark blue. 


Figure 7 highlights the few phenotypes (such as colon cancer, leukemia, and 
deafness) correspond to hubs (large stars) that are connected to a large number 
of distinct disorders, which is consistent with the findings in (Goh et ah, 2007). 

Note that the same approach is also applicable for finding cliques and other 
interesting patterns, since edges with a high number of 4-cliques are likely to be 
members of the largest clique in the network. Figure 8 shows how we can find 
large cliques in the terroristRel data (Zhao, Sen and Getoor, 2006). 

6.3. Real-time Visual Graphlet Mining 

Visual analytics is the science of analytical reasoning facilitated by interactive 
visual interfaces (Thomas and Cook, 2005). This work develops an interactive 
visual graph analytics platform based on the proposed fast graphlet decompo¬ 
sition algorithm. In particular, we integrate interactive visualization with our 
state-of-the-art parallel graphlet decomposition algorithm in order to support 
discovery, analysis, and exploration of such data in real-time. 

We utilize this multi-level graphlet analysis engine that uses graphlets as a 
basis for exploring, analyzing, and understanding complex networks interactively 
in real-time. And, we highlight other key aspects including filtering, querying, 
ranking, manipulating, and a variety of multi-level network analysis and statis¬ 
tical techniques based on graphlets. 

Notably, our proposed algorithm is shown to be fast and efficient for real- 
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Fig. 9. Exploration of the brain neural network of C. Elegans (Watts and Strogatz, 1998) 
using our interactive graphlet visual analytics tool. Nodes are colored by their k-core number 
and weighted by betweenness, whereas the links are colored by eccentricity. 


time interactive exploration and mining of graphlets. We expect this tool to be 
extremely useful to biologists and others interested in understanding biological 
(protein, brain networks, etc.) as well as chemical networks. 

There are a number of important and unique challenges in designing methods 
for interactive exploration and mining of graphlets in real-time. In particular, the 
real-time requirement of such a system requires fast parallel methods to achieve 
real-time interactive rates (e.g., with response times within microseconds or less). 
In particular, we derived dynamic update methods that are localized, that is, the 
update methods leverage the local structure of the graph for efficiently updating 
the counts when nodes/edges are selected, inserted, removed, etc. Thus, given a 
single node or edge, the method updates the graphlet counts for that edge (as 
opposed to recomputing the full graphlet decomposition). 

Figure 9 uses the interactive graphlet mining tool for real-time exploration 
of the brain neural network from C. Elegans (Watts and Strogatz, 1998). Addi¬ 
tionally, the tool is also useful for exploring many other types of networks, e.g., 
a terrorist relationship network is shown in Figure 10 whereas Figure 11 uses 
graphlets as a basis for understanding and characterizing the communities and 
their structure. As an aside, the graph in Figure 11 is generated using the block 
Chung-Lu graph model. Thus, it is straightforward to see how graphlets can be 
used to characterize synthetic graph generators and for evaluating their utility 
(e.g., if the synthetic graph preserves the distribution of graphlets observed in a 
real-world network.). 

The visual graphlet analytics tool is designed for rapid interactive visual ex¬ 
ploration and graph mining (Figure 9-11). Graphlets are computed on-the-fly 
upon a simple drag-and-drop of a graph file into the web browser. Addition¬ 
ally, the graphlet counts are updated efficiently after each selection, insertion, 
deletion, or change to the graph data. Furthermore, it is designed to be con¬ 
sistent with the way humans learn via immediate-feedback upon every user in¬ 
teraction (e.g., change of a slider for filtering) (Ahlberg, Williamson and Shnei- 
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Fig. 10. Illustration of the graphlet methods for real-time interactive graphlet analysis. This 
demonstrates the efficiency and effectiveness of the proposed methods for interactive real-time 
graphlet computations. In the screenshot above, the user selects a subgraph to interactively 
analyze via direct manipulation of the visualization using the mouse. That is, the user adjusts 
the rectangular region above to highlight the subgraphs to analyze. The graphlet statistics are 
updated each time a node/edge is added or removed from the rectangular region used to select 
the subgraph to explore via graphlets. Thus, the user can see how the graphlet statistics change 
as nodes and edges are added (or removed) from the user-specified rectangular region (which in 
turn indicates the nodes and edges to include in the analysis). Note that we leverage localized 
graphlet update methods to achieve the performance required for real-time interactive graphlet 
mining and sense-making. 


derman, 1992; Thomas and Cook, 2005). Users have rapid, incremental, and 
reversible control over all graph queries with immediate and continuous visual 
feedback. 


7. Conclusion Future Work 

In this paper, we proposed a fast, efficient, and parallel algorithm for counting 
graphlets of size fc = {3,4}-nodes that take only a fraction of the time to compute 
when compared with the current methods used. The proposed graphlet counting 
algorithm leverages a number of proven combinatorial arguments for different 
graphlets. For each edge, we count a few graphlets, and with these counts along 
with the combinatorial arguments, we obtain the exact counts of others in con¬ 
stant time. We systematically investigate the scalability of our algorithm on a 
large collection of 300-1- networks from a variety of domains. In future work, we 
aim to extend our proposed algorithm to higher-order graphlets. 
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Fig. 11. Interactive graphlet exploration of community structure via direct manipulation and 
selection of the visual representation 
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